I. Introduction

Diabetes is a chronic disease whose prevalence has increased rapidly in the past few decades. According to the report from WHO, in 2014, the prevalence of diabetes among adult population had risen from 4.7% to 8.5% comparing to that in 1980 (WHO, 2016). It is projected as the seventh leading cause of mortality in 2030 (Mathers & Loncar, 2006).

In the U.S., the data of 2015 shows that 7.2% of the population are diagnosed with Diabetes(CDC, 2017). In 2013, it ranked as the seventh leading cause of death in the nation (CDC). Previous studies identified obesity as the main risk factor of type 2 diabetes (which account for 95% cases of diabetes in the U.S.)(Smyth & Heron, 2006). Meanwhile, consumption of artificial sweetened beverage(Koning et al, 2011) and prenatal exposure to famine under certain circumstances(Lumey, Khalangot & Vaiserman, 2015) are both related to the increased prevalence of type 2 diabetes. The report from CDC also demonstrates that there are certain association between socioeconomic status and the prevalence(CDC, 2017).

Inspiring by these findings which are all somehow related to food environment, we set the goal to explore the association between prevalence of diabetes in United States and food related factors, including median household income, participation proportion of National School Lunch Program (NSLP)/School Breakfast Program (SBP), average number of supermarket and grocery stores/1,000 residents, fast-food restaurants availability, etc. The research focuses on population scale where we explore the relationship between stately diabetes rate and multiple factors of the state.

II. Data and Method

The main dataset we use is Food Environment Atlas which can be found on data.gov.The detailed description of the dataset can be found on the Github page of the project. The detailed description of the dataset can be found on the Github page of the project, named archived_documentation_August2015.pdf. The dataset contains several spreadsheets which provide information on population characteristics and food factors of each state at the county level, which are averaged to be at state level.Some main variables that we are interested in includ:

In addition to this main dataset, we also use another four datasets, Median Household Income (In 2013 Inflation-adjusted Dollars) by State Ranked from Highest to Lowest Using 3-Year Average (2011-2013), Median Household Income (In 2008 Inflation-adjusted Dollars) by State Ranked from Highest to Lowest Using 3-Year Average: 2006-2008 from census.gov., by Centers for Disease Control and Prevention

The two median household income datasets contain the information of median household income of each state (50 states and DC) in 2008 and 2013 collected by U.S. Census Bureau. The income values are evened in a three-year-period (ie. 2006-2008 for 2008, and 2011-2013 for 2013) and adjusted for inflation. These two datasets have been merged into one dataframe based on states, and the names of states are converted to abbreviation in order to be compared with the rest dataframes.

All datasets are downloaded directly from the websites as Excel or CSV format.

health = 
  readxl::read_xls('../data/food_enviroment_atlas.xls', sheet = 'HEALTH') %>%
  clean_names() %>%
  select(1:7)

assistance = 
  readxl::read_xls('../data/food_enviroment_atlas.xls', sheet = 'ASSISTANCE') %>%
  clean_names() %>%
  select(1:3, 23:24, 30:31)

restaurant = 
  readxl::read_xls('../data/food_enviroment_atlas.xls', sheet = 'RESTAURANTS') %>%
  clean_names() %>%
  select(1:9, 16:17)

store = 
  readxl::read_xls('../data/food_enviroment_atlas.xls', sheet = 'STORES') %>%
  clean_names() %>%
  select(1:27)

#diabetes rate of each state in 2009
diab2009 =  readr::read_csv('../data/diabetes2009.csv') %>%
  clean_names() %>% 
  select(1:2) %>% 
  mutate(state = state.abb[match(state,state.name)],
         year = "2009")
## Parsed with column specification:
## cols(
##   State = col_character(),
##   Percentage = col_double(),
##   `Lower Limit` = col_double(),
##   `Upper Limit` = col_double()
## )
diab2009$state[9] = "DC"

#diabetes rate of each state in 2014
diab2014 =  readr::read_csv('../data/diabetes2014.csv') %>%
  clean_names() %>% 
  select(1:2) %>% 
  mutate(state = state.abb[match(state,state.name)],
         year = "2014")
## Parsed with column specification:
## cols(
##   State = col_character(),
##   Percentage = col_double(),
##   `Lower Limit` = col_double(),
##   `Upper Limit` = col_double()
## )
diab2014$state[9] = "DC"

medhincome_08 = readxl::read_xls("../data/mhi_08.xls", range = "A13:B63",col_names = c("state","medhhinc08")) %>%
  clean_names() %>%
  mutate(state = state.abb[match(state,state.name)])
medhincome_08$state[17] = "DC"

medhincome_13 = readxl::read_xls("../data/mhi_13.xls", range = "A11:B61",col_names = c("state","medhhinc13")) %>%
  clean_names() %>%
  mutate(state = state.abb[match(state,state.name)])
medhincome_13$state[8] = "DC"
medhincome_13$`income level` = 
  ifelse(medhincome_13$medhhinc13 < 47242.68,"below average",
         ifelse(medhincome_13$medhhinc13 < 58380.28, "middle income", "above average"))

medhincome = merge(medhincome_08, medhincome_13,by=c("state"))

III. Results

Overview

Considering the high correlation between diabetes and obesity, we first explore the association between the two factors. Then we explore the relationship between diabetes with other possible predictors, such as:

  • socioeconomic status
  • Number of fast-food restaurants
  • National School Lunch Program (NSLP)/School Breakfast Program (SBP)
  • number of supermarket & grocery stores/1,000 residents

Linear regression is used in most models and a comprehensive multiple linear regression model was built.

Diabetes rate vs Obesity

We first take a look at the distribution of diabetes and obesity in United States in 2013. In the following maps, deeper color represents higher rates of diabetes or obesity. The maps illustrates the uneven distribution of stately disease rate and the similar pattern between diabetes rate and obesity rate. Southeastern part has higher diabetes and obesity rates than other parts of the country. Colorado has both very low diabetes and obesities rate.

health_map = 
  health %>%
  group_by(state) %>%
  summarise(diabetes = mean(pct_diabetes_adults13),
            obesity = mean(pct_obese_adults13)) %>%
  ungroup() %>%
  na.omit() %>%
  mutate(full_state = state.name[match(state, state.abb)],
         hover = with(., paste(full_state, '<br>')))

geo_info <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

plot_geo(health_map, locationmode = 'USA-states') %>%
add_trace(
  z = ~diabetes, text = ~hover, locations = ~state,
  color = ~diabetes, colors = 'OrRd'
) %>%
colorbar(title = "Diabetes Rate (%)") %>%
layout(
  title = 'Adult Diabetes Rate, 2013',
  geo = geo_info
  )

plot_geo(health_map, locationmode = 'USA-states') %>%
add_trace(
  z = ~obesity, text = ~hover, locations = ~state,
  color = ~obesity, colors = 'YlGnBu'
) %>%
colorbar(title = "Obesity Rate (%)") %>%
layout(
  title = 'Adult Obesity Rate, 2013',
  geo = geo_info
  )

We then take a look at the distribution of diabetes and obesity nationally. Paired t-test results indicates that both diabetes and obesity rate significantlly increase from 2008 () to 2013. This conclusion is consist with our literature research. Diabtes and obesity rates keep rising in the past decades in USA.(Mokdad A H, et al., 2001)

require(gridExtra)
t_test_diabetes = 
  health %>%
  ggplot() +
  geom_density(aes(x = pct_obese_adults08), color = "#56B4E9", fill ="#56B4E9", alpha = .4) +
  geom_density(aes(x = pct_obese_adults13), color = "#D55E00",fill ="#D55E00", alpha = .4) +
  labs(title = 'National obesity rate in 08 and 13',
       x = 'rate of obesity(%)')

t_test_obesity = 
  health %>%
  ggplot() +
  geom_density(aes(x = pct_diabetes_adults08), color = "#56B4E9", fill ="#56B4E9", alpha = .4) +
  geom_density(aes(x = pct_diabetes_adults13), color = "#D55E00",fill ="#D55E00", alpha = .4) +
  labs(title = 'National obesites rate in 08 and 13',
       x = 'rate of diabetes(%)')

grid.arrange(t_test_diabetes, t_test_obesity, ncol=2)


t.test(health$pct_obese_adults08, health$pct_obese_adults13, paired = TRUE) 
## 
##  Paired t-test
## 
## data:  health$pct_obese_adults08 and health$pct_obese_adults13
## t = -45.534, df = 3136, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.175697 -1.996060
## sample estimates:
## mean of the differences 
##               -2.085878
t.test(health$pct_diabetes_adults08, health$pct_diabetes_adults13, paired = TRUE)
## 
##  Paired t-test
## 
## data:  health$pct_diabetes_adults08 and health$pct_diabetes_adults13
## t = -58.258, df = 3136, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.374594 -1.285081
## sample estimates:
## mean of the differences 
##               -1.329837

Finding the similar distribution of diabetes and obesity, we continue to test the relationship between diabetes and obesity in 2008 and 2013 respectively. Using the linear regression, we find that there are significant linear relationship between diabetes and obesity both in 2008 and 2013, where about 50% of the relationship could be explained by the linear model. This conclusion is consistent with other researches.(Nguyen, et al., 2011)

require(gridExtra)

relation_2008 = 
  health %>%
  ggplot() +
  geom_point(aes(x = pct_obese_adults08, y = pct_diabetes_adults08), color="#56B4E9", alpha = 0.5)+
  geom_smooth(aes(x = pct_obese_adults08, y = pct_diabetes_adults08), color="#0072B2", method = 'lm') +
  labs(title = 'Relation in 2008',
       x = 'obesity percentage',
       y = 'diabetes percentage')

relation_2013 = 
  health %>%
  ggplot() +
  geom_point(aes(x = pct_obese_adults13, y = pct_diabetes_adults13), color="#F78725", alpha = 0.5)+
  geom_smooth(aes(x = pct_obese_adults13, y = pct_diabetes_adults13), color="#D55E00", method = 'lm') +
  labs(title = 'Relation in 2013',
       x = 'obesity percentage',
       y = 'diabetes percentage')
grid.arrange(relation_2008, relation_2013, ncol=2)


lm(pct_diabetes_adults08~pct_obese_adults08, data = health) %>% summary()
## 
## Call:
## lm(formula = pct_diabetes_adults08 ~ pct_obese_adults08, data = health)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3246 -0.9389  0.0177  0.9388  5.1372 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.670956   0.200017  -8.354   <2e-16 ***
## pct_obese_adults08  0.400413   0.006857  58.391   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.425 on 3136 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.5209, Adjusted R-squared:  0.5207 
## F-statistic:  3409 on 1 and 3136 DF,  p-value: < 2.2e-16
lm(pct_diabetes_adults13~pct_obese_adults13, data = health) %>% summary()
## 
## Call:
## lm(formula = pct_diabetes_adults13 ~ pct_obese_adults13, data = health)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2119 -1.1838 -0.0385  1.1507  7.6523 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.514395   0.222657   -2.31   0.0209 *  
## pct_obese_adults13  0.378840   0.007103   53.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.801 on 3140 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4753, Adjusted R-squared:  0.4751 
## F-statistic:  2844 on 1 and 3140 DF,  p-value: < 2.2e-16

Diabetes rate vs Social-economics Status

In this section, the relationship between diabetes and social-economic status is explored.

diabetes = health %>%
  group_by(state) %>%
  summarise(pct_08 = mean(pct_diabetes_adults08),
            pct_13 = mean(pct_diabetes_adults13)) %>%
  gather(key = year, value = pct_diabetes_adults, pct_08:pct_13)%>%
  separate(year, into = c("remove", "year"), sep = "_") %>%
  select(-remove)

obese = health %>%
  group_by(state) %>%
  summarise(pct_08 = mean(pct_obese_adults08),
            pct_13 = mean(pct_obese_adults13)) %>%
  gather(key = year, value = pct_obese_adults, pct_08:pct_13)%>%
  separate(year, into = c("remove", "year"), sep = "_") %>%
  select(-remove)

health_status = merge(diabetes, obese, by=c("state", "year"))

income = medhincome %>%
  select(-`income level`) %>%
  gather(key = year, value = medhhinc, medhhinc08:medhhinc13)%>%
  separate(year, into = c("remove", "year"), sep = -3) %>%
  select(-remove)

eco_health = merge(health_status, income, by=c("state", "year"))

eco_health %>%
  group_by(year) %>%
  ggplot(aes(x = medhhinc, y = pct_diabetes_adults)) +
  geom_point(aes(color = year), size = 3, alpha = .8) +
  geom_smooth(se = FALSE) + 
  facet_grid(. ~ year) +
  labs(title = "Median household income and adult diabetes rate relationship",
    x = "Median household income(Dollars)",
    y = "Percentage of adult diabetes")  
## `geom_smooth()` using method = 'loess'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).


summary(eco_health %>% 
          filter(year == "13", medhhinc < 60000) %>%
          lm(pct_diabetes_adults ~ medhhinc, data = .))
## 
## Call:
## lm(formula = pct_diabetes_adults ~ medhhinc, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5219 -0.3773  0.1015  0.6120  2.7064 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.422e+01  1.815e+00  13.345 6.41e-16 ***
## medhhinc    -2.648e-04  3.644e-05  -7.267 1.07e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.254 on 38 degrees of freedom
## Multiple R-squared:  0.5815, Adjusted R-squared:  0.5705 
## F-statistic:  52.8 on 1 and 38 DF,  p-value: 1.075e-08

summary(eco_health %>% 
          filter(year == "13", medhhinc >= 60000) %>%
          lm(pct_diabetes_adults ~ medhhinc, data = .))
## 
## Call:
## lm(formula = pct_diabetes_adults ~ medhhinc, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.56643 -0.86159  0.01438  0.65555  2.10019 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -6.5529967  7.0811187  -0.925   0.3818  
## medhhinc     0.0002405  0.0001103   2.179   0.0609 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.18 on 8 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3725, Adjusted R-squared:  0.2941 
## F-statistic:  4.75 on 1 and 8 DF,  p-value: 0.06092

Scatterplots are made to show the association between median household income and diabetes rate in the U.S. in both 2008 and 2013. Overall, we can observe that the adult diabetes rate increased from 08 to 13, which matches the finding in the previous section. Comparing the two plots, we can observe a similar pattern: when the median household income is below 60,000 dollars, there is a decreasing trend; when the median household income is above 60,000 dollars, there is an increasing trend.

Two separate linear models are fitted regarding the 60,000 dollars threshold for data in 2013. The one for below 60,000 dollars is proved to be significant, with a \(R^{2}\) of 0.58, indicating that 58% variation in the adult diabetes rate can be explained by the median household income. However, there is not statistical significant linear relationship between diabetes rate and median household income when the income is above 60,000 dollars. The result indicates that social-economic status may be a confounder to the relationship between diabetes and other factors.

Diabetes Rate vs School Breakfast/Lunch Program Participant Propotion

In the following section, these two relationships are analyzed:

  • The relationship between diabetes rate (2013) and average School Breakfast program (SBP) participation rate(%) (average between 2009 and 2015).
  • The relationship between diabetes rate (2013) and National School Lunch Program (NSLP) participation rate(%) (average between 2009 and 2015).
participation_rate = assistance %>% 
  group_by(state) %>% 
  mutate(`School Lunch Program` = mean(pct_nslp09, na.rm = T) + mean(pct_nslp15, na.rm = T)) %>% 
  mutate(`School Breakfast Program` = mean(pct_sbp09, na.rm = T) + mean(pct_sbp15, na.rm = T)) %>% 
  ungroup(state) %>% 
  select(state,`School Lunch Program`,`School Breakfast Program`) %>% 
  filter(!(duplicated(state)))

health_13 = health %>% 
  group_by(state) %>% 
  mutate(diabetes_13 = mean(pct_diabetes_adults13, na.rm = T),
         obesites_13 = mean(pct_obese_adults13, na.rm = T)) %>% 
  mutate(statefips = str_sub(fips, 1, 2))%>% 
  filter(!duplicated(state)) %>% 
  ungroup(state) %>%
  select(state, diabetes_13, obesites_13) %>% 
  left_join(participation_rate, by = "state") %>% 
  left_join(medhincome_13, by = "state") %>% 
  gather(key = "program", value = "participant proportion", `School Lunch Program`:`School Breakfast Program`)

ggplot(health_13,aes(x = `participant proportion`, y = diabetes_13)) +
  geom_point(aes(color = `income level`)) +
  geom_smooth(method = "lm",formula = y~x) +
     ggpmisc::stat_poly_eq(formula = y~x,
            aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
            parse = TRUE) +
  labs(title = "Diabetes Rate vs. Program Participant",
    x = "Participant Proportion(%)",
    y = "Diabetes Rate(%)") +
  theme(legend.position = "bottom") +
  facet_grid(. ~program, scales = "free")

Note: the Income level refers to the level of median household income of each state. Above average income level is defined as median household income above 75% quantile of all the states. Below average income level is defined as median household income below 25% quantile of all the states. Middle income is between 25% quantile and 75% quantile.

The linear relationship between diabetes rate (2013) and SBP/NSLP are both significant with adjusted \(R^{2}\) of 40% and 26% respectively. The results indicate that there is a positive relationship.

From the scatter plot, we can see that most states with below average median household income (green dots) are above the regression line and most states with above average household income (red dots) are below the line. Therefore, income level may differ the relationship between participant proportion and diabetes rate. In those states with lower average household income, for every unit increase in participant proportion, the expected increase in diabetes rate may be higher than states with higher household income.

Diabetes Rate vs Average Supermarkets & Grocery Stores number/1,000 residents

Do linear regression between diabetes rate within state and average number of supermarket & grocery store number/1000 residents in each state (average over each county) in 2009 and 2014 (due to limit of data resource).

health_store = store %>% 
  group_by(state) %>% 
  mutate(store_rate_2009 = mean(grocpth09, na.rm = T),
         store_rate_2014 = mean(grocpth14, na.rm = T),
         store_mean = (store_rate_2009+store_rate_2014)/2) %>% 
  select(state,store_rate_2009,store_rate_2014,store_mean) %>% 
  filter(!(duplicated(state)))

health_store1 = health_store %>% 
  select(-store_mean) %>% 
  left_join(medhincome_13, by = "state") %>% 
  gather(key = "year", value = "store_rate", store_rate_2009:store_rate_2014) %>% 
  mutate(year = str_sub(year, -4, -1)) %>% 
  left_join(rbind(diab2009,diab2014), by = c("state","year")) %>% 
  ggplot(aes(x = store_rate, y = percentage)) +
     geom_point(aes(color = `income level`), alpha = 0.7) +
     geom_smooth(method = "lm", formula = y~x) +
     ggpmisc::stat_poly_eq(formula = y~x,
            aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
            parse = TRUE) +
   coord_cartesian(xlim = c(0.15, 0.5)) +
   labs(title = "",
    x = "Average supermarket & grocery store/1000 residents",
    y = "Diabetes Rate(%)") +
    facet_grid(. ~ year)

health_store1  

Note: the Income level refers to the level median household income of each state. Above average income level is defined as median household income above 75% quantile of all the states. Below average income level is defined as median household income below 25% quantile of all the states. Middle income is between 25% quantile and 75% quantile.

Overall, there is a negative linear relationship between diabetes rate and average number of supermarket & grocery stores/1000 residents with \(R^2\) 0.22 and 0.3 respectively. The negative association is consistant with the negative relationship between diabetes rate and food accessibility. Lower accessibility to grocery stores and farmers’ market is relateved to higher diabetes and obesitiy rate (Salois, 2012).

States with below average household income are clustered at lower average number of supermarke & grocery stores/1000 residents and high Diabetes Rate, which is consistant with previous findings. In addition, obesity may be a confounder in the diabetes rate vs. store number/1000 residents relationship, since lower food accessibility are both correlated with diabetes and obesity.

A multiple linear model that capture the overall association

After identifying the food related factors that may have certain relationship with adult diabetes prevalence, we decide to construct an explanatory linear model to capture the association as a whole. Since in previous section, we observed a strong linear association between median household income and adult diabetes rate when the income is below 60,000 dollars, we focus on building a model to explain the association between the covariates identified from previous sections and adult diabetes rate in this sub-population.

The variables we are interested in to build the model for adult diabetes rate at the state level include:

  • adult obesity rate (obe)
  • median household income (MHI)
  • lunch program participation rate (LPPR)
  • the number of Fast-food restaurants per 1,000 population (FFPP)
  • the Expenditures per capita of fast food (EFF)
  • the number of grocery store and supermarket per 1,000 population (store)
store_data = health_store %>%
  select(state, store_rate_2014)
lunch_data = health_13[c(1:51),]
colnames(compare1)[1:3] = c("state", "diabetes_13", "obesites_13")
model_data = merge(lunch_data, compare1, by = c("state", "diabetes_13", "obesites_13")) %>%
  merge(.,store_data, by = "state") %>%
  select(-`income level`, - program, - ffr14, - state) %>% 
  filter(medhhinc13 < 60000)

# correlation of variables
correlation_data = cor(model_data)

# find best model
## stepwise and backward
mult.fit.full <- lm(diabetes_13 ~ ., data=model_data)
step(mult.fit.full, direction='both')
## Start:  AIC=-28.13
## diabetes_13 ~ obesites_13 + medhhinc13 + `participant proportion` + 
##     ffrpth14 + pc_ffrsales12 + store_rate_2014
## 
##                            Df Sum of Sq    RSS     AIC
## - ffrpth14                  1    0.3610 14.314 -29.105
## <none>                                  13.953 -28.127
## - `participant proportion`  1    1.0352 14.988 -27.264
## - medhhinc13                1    2.4376 16.391 -23.686
## - pc_ffrsales12             1    2.5394 16.493 -23.439
## - store_rate_2014           1    2.6924 16.646 -23.069
## - obesites_13               1   24.3011 38.254  10.215
## 
## Step:  AIC=-29.11
## diabetes_13 ~ obesites_13 + medhhinc13 + `participant proportion` + 
##     pc_ffrsales12 + store_rate_2014
## 
##                            Df Sum of Sq    RSS      AIC
## <none>                                  14.314 -29.1051
## - `participant proportion`  1    0.9749 15.289 -28.4696
## + ffrpth14                  1    0.3610 13.953 -28.1268
## - medhhinc13                1    2.2730 16.587 -25.2101
## - pc_ffrsales12             1    2.4868 16.801 -24.6977
## - store_rate_2014           1    4.4128 18.727 -20.3566
## - obesites_13               1   25.3428 39.657   9.6556
## 
## Call:
## lm(formula = diabetes_13 ~ obesites_13 + medhhinc13 + `participant proportion` + 
##     pc_ffrsales12 + store_rate_2014, data = model_data)
## 
## Coefficients:
##              (Intercept)               obesites_13  
##                2.653e+00                 3.958e-01  
##               medhhinc13  `participant proportion`  
##               -6.359e-05                -6.442e-02  
##            pc_ffrsales12           store_rate_2014  
##                3.364e-03                -4.494e+00
step(mult.fit.full, direction='back')
## Start:  AIC=-28.13
## diabetes_13 ~ obesites_13 + medhhinc13 + `participant proportion` + 
##     ffrpth14 + pc_ffrsales12 + store_rate_2014
## 
##                            Df Sum of Sq    RSS     AIC
## - ffrpth14                  1    0.3610 14.314 -29.105
## <none>                                  13.953 -28.127
## - `participant proportion`  1    1.0352 14.988 -27.264
## - medhhinc13                1    2.4376 16.391 -23.686
## - pc_ffrsales12             1    2.5394 16.493 -23.439
## - store_rate_2014           1    2.6924 16.646 -23.069
## - obesites_13               1   24.3011 38.254  10.215
## 
## Step:  AIC=-29.11
## diabetes_13 ~ obesites_13 + medhhinc13 + `participant proportion` + 
##     pc_ffrsales12 + store_rate_2014
## 
##                            Df Sum of Sq    RSS      AIC
## <none>                                  14.314 -29.1051
## - `participant proportion`  1    0.9749 15.289 -28.4696
## - medhhinc13                1    2.2730 16.587 -25.2101
## - pc_ffrsales12             1    2.4868 16.801 -24.6977
## - store_rate_2014           1    4.4128 18.727 -20.3566
## - obesites_13               1   25.3428 39.657   9.6556
## 
## Call:
## lm(formula = diabetes_13 ~ obesites_13 + medhhinc13 + `participant proportion` + 
##     pc_ffrsales12 + store_rate_2014, data = model_data)
## 
## Coefficients:
##              (Intercept)               obesites_13  
##                2.653e+00                 3.958e-01  
##               medhhinc13  `participant proportion`  
##               -6.359e-05                -6.442e-02  
##            pc_ffrsales12           store_rate_2014  
##                3.364e-03                -4.494e+00

## criterion based
best <- function(model, ...) 
{
  subsets <- leaps::regsubsets(formula(model), model.frame(model), ...)
  subsets <- with(summary(subsets),
                  cbind(p = as.numeric(rownames(which)), which, rss, rsq, adjr2, cp, bic))
  
  return(subsets)
}  

# Select the 'best' 1 model of all subsets
round(best(mult.fit.full, nbest = 1), 4)
##   p (Intercept) obesites_13 medhhinc13 `participant proportion` ffrpth14
## 1 1           1           1          0                        0        0
## 2 2           1           1          0                        0        0
## 3 3           1           1          1                        0        0
## 4 4           1           1          1                        0        0
## 5 5           1           1          1                        1        0
## 6 6           1           1          1                        1        1
##   pc_ffrsales12 store_rate_2014     rss    rsq  adjr2      cp      bic
## 1             0               0 36.2046 0.7464 0.7398 49.6256 -47.5066
## 2             0               1 21.1216 0.8521 0.8441 15.9536 -65.3733
## 3             0               1 16.8895 0.8817 0.8719  7.9444 -70.6287
## 4             1               1 15.2891 0.8929 0.8807  6.1594 -70.9218
## 5             1               1 14.3142 0.8997 0.8850  5.8538 -69.8684
## 6             1               1 13.9532 0.9023 0.8845  7.0000 -67.2012

best_fit = lm(formula = diabetes_13 ~ obesites_13 + medhhinc13 + `participant proportion` + 
    pc_ffrsales12 + store_rate_2014, data = model_data)

best_fit %>% summary()
## 
## Call:
## lm(formula = diabetes_13 ~ obesites_13 + medhhinc13 + `participant proportion` + 
##     pc_ffrsales12 + store_rate_2014, data = model_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.50748 -0.35640  0.09374  0.34144  1.34475 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.653e+00  2.682e+00   0.989  0.32951    
## obesites_13               3.958e-01  5.102e-02   7.759 5.01e-09 ***
## medhhinc13               -6.359e-05  2.737e-05  -2.324  0.02626 *  
## `participant proportion` -6.442e-02  4.234e-02  -1.522  0.13733    
## pc_ffrsales12             3.364e-03  1.384e-03   2.430  0.02051 *  
## store_rate_2014          -4.494e+00  1.388e+00  -3.238  0.00269 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6488 on 34 degrees of freedom
## Multiple R-squared:  0.8997, Adjusted R-squared:  0.885 
## F-statistic: 61.03 on 5 and 34 DF,  p-value: 5.274e-16

Based on the result of both stepwise regression, backward elimination (based on AIP), and criterion-based procedures, we get the following model:

\[Y_{diabetes\ rate} = 2.653 + 0.3958 \cdot X_{obe} - 0.00006 \cdot X_{MHI}- 0.06442 \cdot X_{LPPR} + 0.00337 \cdot X_{EFF} - 4.494 \cdot X_{store}\]

check = influence.measures(best_fit)
# par(mar=c(4,4,1,1))
# par(mfrow=c(2,2))
# plot(best_fit)

After checking the model assumptions and influential values, we conclude that this model is a good one to capture the association. The adjusted \(R^2\) is 0.885, indicating that 88.5% variation in adult obesity rate can be explained by this linear model.

IV. Discussion

Major Findings

Most of our results (the association between diabetes rate and obesity, social-economic factors and food enviroment related factors) are conformed with previous studies. There is a strong association between diabetes and obesity and a strong association between diabetes and social-economic status. When the median household income is below 60,000 dollars, there is a decreasing trend between income and diabetes rate, while when the income is above, 60,000 dollars, the trend turns to be positive. As for the restaurant-related factors, combining the number of restaurants per 1000 population with the expenditures per capita on fast food, there is a linear relationship with diabetes rates. Surprisingly, there is a positive relationship between diabetes rate and National School Lunch Program/School Breakfast Program participation proportion, which is counterintuitive – NSLP and SBP serve nutritious food to children and should negatively correlated with diabetes rate It may be because the association is confounded by other factors such as social-economic factors because children who are eligible for NSLP and SBP tend to come from families with low income. In addition, there is a negative association between diabetes rate and average supermarkets & grocery stores number/1,000 residents.

Based on the result after exploring the one by one relationship betweent adult diabetes rate in the U.S. and each food environment related vairable, a final explanatory linear model is built as: \[Y_{diabetes\ rate} = 2.653 + 0.3958 \cdot X_{obe} - 0.00006 \cdot X_{MHI}- 0.06442 \cdot X_{LPPR} + 0.00337 \cdot X_{EFF} - 4.494 \cdot X_{store}\] As explained in the result section, this is a good model to capture the relashionship between food envirement factors and U.S. adult diabetes rate based on the adjusted \(R^2\) value and the p value of each parameter.

Weaknesses

We mostly used linear regression to detect the relationship, which is limited when exploring association with higher complexity. Even if we explored the relationship with several single factor and then combine them into a full model, the confounding effect has not been discussed in detail. In addition, the main dataset is incomplete with only data of diabetes rate in 2008 and 2013, which causes the limitation of our findings.

The final model we construted is not a prediction model, but a explanatory model. This means we cannot using food environment factors to predict the U.S. adult diabetes rate. One main reason that we did not construt a prediction model is the mismatch of year for different variable. Moreover, since we did the analysis at the state level, some meaningful pattern at county level may be diluted as we simply even the variable among all the counties without considering the population in each county. With higher quility data, further study may try to build a more complex model to achieve the goal of prediction.

Reference